AITopics | audio source

Collaborating Authors

audio source

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ab6b331e94c28169d15cca0cb3bbc73e-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-9-2026, 18:41:29 GMT

baseline, formulation, random cnn feature, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.51)

Add feedback

ab6b331e94c28169d15cca0cb3bbc73e-AuthorFeedback.pdf

Neural Information Processing SystemsAug-15-2025, 17:27:10 GMT

baseline, formulation, random cnn feature, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.51)

Add feedback

Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion

Kulkarni, Ajinkya, Dowerah, Sandipana, Alumae, Tanel, -Doss, Mathew Magimai.

arXiv.org Artificial IntelligenceJun-4-2025

Audio deepfakes are acquiring an unprecedented level of realism with advanced AI. While current research focuses on discerning real speech from spoofed speech, tracing the source system is equally crucial. This work proposes a novel audio source tracing system combining deep metric multi-class N-pair loss with Real Emphasis and Fake Dispersion framework, a Conformer classification network, and ensemble score-embedding fusion. The N-pair loss improves discriminative ability, while Real Emphasis and Fake Dispersion enhance robustness by focusing on differentiating real and fake speech patterns. The Conformer network captures both global and local dependencies in the audio signal, crucial for source tracing. The proposed ensemble score-embedding fusion shows an optimal trade-off between in-domain and out-of-domain source tracing scenarios. We evaluate our method using Frechet Distance and standard metrics, demonstrating superior performance in source tracing over the baseline system.

artificial intelligence, audio source, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2506.02085

Country: Europe (0.28)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience

Jiang, Xilin, Han, Cong, Li, Yinghao Aaron, Mesgarani, Nima

arXiv.org Artificial IntelligenceFeb-6-2024

In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces "Listen, Chat, and Edit" (LCE), a novel multimodal sound mixture editor that modifies each sound source in a mixture based on user-provided text instructions. LCE distinguishes itself with a user-friendly chat interface and its unique ability to edit multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for editing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles it into the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse editing tasks like extraction, removal, and volume control. Our experiments demonstrate significant improvements in signal quality across all editing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources.

audio source, text prompt, text-guided soundscape modification, (11 more...)

arXiv.org Artificial Intelligence

2402.0371

Country:

Asia > Middle East > Republic of Türkiye (0.14)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (0.68)
Health & Medicine > Therapeutic Area (0.46)
Media > Music (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Add feedback

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

Tan, Reuben, Ray, Arijit, Burns, Andrea, Plummer, Bryan A., Salamon, Justin, Nieto, Oriol, Russell, Bryan, Saenko, Kate

arXiv.org Artificial IntelligenceSep-23-2023

We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to annotations during training. To overcome this challenge, we adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions and encourage a stronger alignment between the audio, visual and natural language modalities. During inference, our approach can separate sounds given text, video and audio input, or given text and audio input alone. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets, including MUSIC, SOLOS and AudioSet, where we outperform state-of-the-art strongly supervised approaches despite not using object detectors or text labels during training.

representation, separation, video, (14 more...)

arXiv.org Artificial Intelligence

2303.16342

Country:

North America > Canada (0.04)
Europe > Finland > Pirkanmaa > Tampere (0.04)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment (0.93)
Media > Music (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

Mukhopadhyay, Soumik, Suri, Saksham, Gadde, Ravi Teja, Shrivastava, Abhinav

arXiv.org Artificial IntelligenceAug-18-2023

The task of lip synchronization (lip-sync) seeks to match the lips of human faces with different audio. It has various applications in the film industry as well as for creating virtual avatars and for video conferencing. This is a challenging problem as one needs to simultaneously introduce detailed, realistic lip movements while preserving the identity, pose, emotions, and image quality. Many of the previous methods trying to solve this problem suffer from image quality degradation due to a lack of complete contextual information. In this paper, we present Diff2Lip, an audio-conditioned diffusion-based model which is able to do lip synchronization in-the-wild while preserving these qualities. We train our model on Voxceleb2, a video dataset containing in-the-wild talking face videos. Extensive studies show that our method outperforms popular methods like Wav2Lip and PC-AVS in Fr\'echet inception distance (FID) metric and Mean Opinion Scores (MOS) of the users. We show results on both reconstruction (same audio-video inputs) as well as cross (different audio-video inputs) settings on Voxceleb2 and LRW datasets. Video results and code can be accessed from our project page ( https://soumik-kanad.github.io/diff2lip ).

artificial intelligence, diff2lip, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2308.09716

Country:

North America > United States > New York (0.04)
North America > United States > Maryland > Prince George's County > College Park (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(3 more...)

Genre: Research Report (0.64)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Speech recognition using python

#artificialintelligenceNov-6-2021, 20:40:18 GMT

Speech Recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to textual information. You have probably seen it on Sci-fi, and personal assistants like Siri, Cortana, and Google Assistant, and other virtual assistants that interact with through voice. These AI assistants in order to understand your voice they need to do speech recognition so as to understand what you have just said. Speech Recognition is a complex process, well I'm not going to teach you how to train a Machine Learning/Deep Learning Model to that, instead, I instruct you how to do that using google speech recognition API. As long as you have the basics of Python you can successfully complete this tutorial and build your own fully functioning speech recognition programs in Python.

perform speech recognition, recognition, speech recognition, (7 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.57)

Add feedback

Q Acoustics Q Active 200 review: This high-end powered bookshelf audio system delivers impeccable performance

PCWorldJan-22-2021, 12:00:00 GMT

Q Acoustics builds mighty-fine loudspeakers, and for its first self-powered offering, the company could have modified any of its existing designs by bolting on an amplifier and calling it a day. What it has wrought instead is a complete high-end audio system that can accommodate nearly any source: analog or digital, wired or wireless, streaming or locally sourced; one that can be incorporated into any of the most common home-audio and smart-home ecosystems. The Q Active 200 system consists of a pair of self-amplified, wireless two-way bookshelf speakers and the Q Active Control Hub (the company will soon offer the same technology in a tower speaker system, the Q Active 400). The broad range of audio sources the Hub can handle range from a server on your network, to most of the popular streaming services, to a turntable equipped with a moving-magnet cartridge. It can then send that music both to its own speakers and to other audio systems on your network, using Apple AirPlay 2 or Google Chromecast.

audio, digital audio, hub, (10 more...)

PCWorld

Industry:

Leisure & Entertainment (0.68)
Media > Music (0.49)
Media > Radio (0.34)

Technology:

Information Technology > Communications (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.48)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.48)

Add feedback

OtoWorld: Towards Learning to Separate by Learning to Move

Ranadive, Omkar, Gasser, Grant, Terpay, David, Seetharaman, Prem

arXiv.org Machine LearningJul-12-2020

We present OtoWorld, an interactive environment in which agents must learn to listen in order to solve navigational tasks. The purpose of OtoWorld is to facilitate reinforcement learning research in computer audition, where agents must learn to listen to the world around them to navigate. OtoWorld is built on three open source libraries: OpenAI Gym for environment and agent interaction, PyRoomAcoustics for ray-tracing and acoustics simulation, and nussl for training deep computer audition models. OtoWorld is the audio analogue of GridWorld, a simple navigation game. OtoWorld can be easily extended to more complex environments and games. To solve one episode of OtoWorld, an agent must move towards each sounding source in the auditory scene and "turn it off". The agent receives no other input than the current sound of the room. The sources are placed randomly within the room and can vary in number. The agent receives a reward for turning off a source. We present preliminary results on the ability of agents to win at OtoWorld. OtoWorld is open-source and available.

agent, artificial intelligence, upstream oil & gas, (17 more...)

arXiv.org Machine Learning

2007.06123

Country:

North America > United States (0.14)
Europe > Austria (0.14)

Genre: Research Report (0.40)

Industry: Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.48)

Add feedback

Generate video from any given audio source

#artificialintelligenceFeb-1-2020, 17:34:25 GMT

Sign in to report inappropriate content. This paper presents a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video.

artificial intelligence, generate video, machine learning, (2 more...)

#artificialintelligence

Technology:

Information Technology > Communications > Social Media (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)

Add feedback